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ABSTRACT 


We  survey  and  discuss  analytic  performance  models  for  shared-memory  computers.  Our  purpose  is  to  assess 
these  models  to  determine  if  they  are  relevant  to  the  task  of  assisting  the  Office  of  Export  Administration  in  es- 
tablishing guidelines  for  exports  of  computers.  The  focus  of  this  study  is  on  the  assumptions  made  by  modelers 
regarding  interconnection  networks  and  memory  access  patterns  exhibited  by  applications.  The  great  majority  of 
models  concentrate  on  analyzing  a small  collection  of  networks.  Furthermore,  most  modelers  make  strong  as- 
sumptions regarding  the  independence  and  uniform  distribution  of  memory  accesses.  The  consequences  of  these 
assumptions  in  the  resulting  models  are  noted. 
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1.  Introduction. 


The  Office  of  Export  Administration  of  the  Department  of  Commerce  has  requested  the  assistance  of  the 
National  Computer  and  Telecommunications  Laboratory,  NIST,  in  establishing  guidelines  for  export  of  comput- 
ers. One  possible  avenue  is  the  use  of  analytic  models  to  predict  performance  of  computers  (alternatives  such  as 
simulation  and  benchmarking  are  not  treated  here).  We  examine  a number  of  models  which  have  been  proposed 
for  the  performance  evaluation  of  parallel  computers  in  which  processors  share  common  memory.  We  are  not 
concerned  here  with  the  details  of  models;  rather,  we  are  interested  in  the  question  of  whether  such  modeling 
efforts  are  primarily  of  interest  to  theoreticians  and  machine  designers,  or  whether  they  can  serve  as  accurate 
and  reliable  predictors  of  performance  of  real  machines  on  real  applications. 

We  assume  that  the  memory  of  a shared-memory  computer  is  partitioned  into  modules;  processors  access 
the  modules  via  some  form  of  interconnection  network.  Alternatively,  memory  may  be  shared  but  distributed, 
with  processor/memory  pairs  connected  by  an  interconnection  network.  In  either  case  the  main  factor  in  perfor- 
mance is  the  efficiency  of  the  network  in  permitting  processors  to  access  memory.  In  Sections  2 and  3 we  re- 
view some  classes  of  interconnection  networks.  In  Sections  4-8  we  survey  various  performance  models,  and  in 
Section  9 we  return  to  the  question  of  the  value  of  these  models  in  the  present  context 

In  passing  we  remark  that  hierarchical  systems,  in  which  processors  are  grouped  into  clusters  sharing  cache 
or  common  memory,  are  not  covered  here. 
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2.  Interconnection  networks. 


By  interconnection  network  we  refer  to  any  structure  connecting  a set  of  sources  to  a set  of  sinks.  Typically 
the  sources  are  processors  and  the  sinks  are  memory  modules;  this  is  the  case  we  will  refer  to  in  the  following. 
Alternatively,  however,  both  sources  and  sinks  could  be  processor/memory  pairs.  Four  major  categories  of  net- 
works are: 

A.  crossbar 

B.  multistage  network 

C.  multiple  bus 

D.  single  bus 

Crossbars  provide  complete  interconnection;  i.e.  it  is  possible  to  concurrently  connect  any  n sources  to  any 
n sinks.  Their  expense,  however,  is  prohibitive  for  large  n.  Single  buses  are  inexpensive,  and  have  been  heavily 
used  to  date;  however,  they  permit  connection  of  only  one  (source,sink)  pair  at  one  time,  and  hence  are  suitable 
only  for  small  numbers  of  sources.  The  other  two  categories  are  intermediate  with  respect  to  both  expense  and 
connection  capability.  Multiple  bus  systems  consist  of  several  buses  with  each  bus  connected  to  all  sinks.  They 
may  be  complete  (each  bus  connected  to  all  sources)  or  partial  (each  source  connected  to  a subset  of  buses). 
Multistage  networks  have  been  intensively  studied,  and  many  references  can  be  found  in  the  bibliography  (e.g. 
[39], [51], [97]).  It  seems  likely  that  multistage  networks  will  grow  increasingly  important  as  systems  incorporate 
larger  numbers  of  processors.  Thus  we  review  them  in  detail  in  the  next  section. 
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3.  Taxonomy  of  multistage  networks. 

Multistage  networks  consist  of  switches  which  route  communications  between  sources  and  sinks.  The 
switches  are  arranged  in  several  stages,  with  the  sources  connected  to  the  inputs  of  the  first  stage,  and  the  sinks 
connected  to  the  outputs  of  the  last  stage.  In  intermediate  stages,  the  outputs  of  a stage  are  connected  to  the  in- 
puts of  the  next  stage.  Communication  consists  of  a request  for  access  by  a source  to  a sink.  In  practice  a suc- 
cessful request  is  followed  by  transmission  of  data  from  sink  to  source,  but  for  simplicity  we  refer  only  to  re- 
quests for  access  by  sources.  Multistage  networks  can  be  classified  according  to  features  such  as  the  following 
(not  all  terminology  is  universal): 

l.  transmission  mode  and  capability 

A.  packet-switching 

i.  blocking 

a.  blocked  request  queued 

b.  blocked  request  discarded 

ii.  nonblocking 

B.  circuit-switching 

i.  blocking 

a.  queued 

b.  rejected 

ii.  non-blocking 

a.  strict 

b.  wide-sense 

iii.  rearrangeable 

II.  Paths  from  a source  to  a sink 

A.  unique  path 

B.  multipath 

m.  Switch  control 

A.  central 

B.  distributed 
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C.  stage 

IV.  Synchronicity 

A.  synchronous 

B.  asynchronous 

V.  Permutation  realization  capability 

VI.  Switch  boxes 

A.  fan-in  / fan-out 

i.  rectangular  (kxn) 

ii.  square  (n  x n) 

B.  functions 

i.  interchange 

ii.  broadcast 

C.  queueing  of  requests 

i.  buffered 

ii.  unbuffered 
Do  arbitration  policy 

i.  fair 

ii.  biased 

VII.  Topology 

A.  omega 

B.  indirect  binary  n-cube 

C.  generalized  cube 

D.  banyan 

E.  delta 

F.  baseline 

G.  data  manipulator 

H.  flip 

I.  other 
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Packet-switching  networks  ([23], [24], [33])  route  packets  individually  from  source  to  sink  without  establish- 
ing a fixed  connection;  i.e.  links  are  used  and  then  relinquished.  They  may  be  non-blocking;  i.e.  a packet  can 
always  be  routed  without  conflict  More  commonly  they  are  blocking;  i.e.  two  or  more  packets  may  arrive  on 
the  input  wires  of  a switch  requesting  the  same  output  wire.  If  the  network  is  buffered,  packets  may  be  queued; 
otherwise  one  packet  will  be  selected  and  the  rest  lost  presumably  requiring  resubmission. 

Circuit-switching  networks  ([17], [103], [107])  set  up  a fixed  connection  between  source  and  sink  for  the 
duration  of  a message;  i.e.  links  are  held  rather  than  relinquished.  Again  the  network  may  be  blocking;  but  if  a 
conflict  occurs  at  a box,  a partially  established  circuit  may  need  to  be  tom  down  and  a retry  made  by  the  issuing 
source.  In  [17]  the  non-blocking  case  is  split:  strictly  non-blocking  means  circuits  can  be  constructed  at  will 
without  fear  of  blocking;  wide-sense  means  a particular  algorithm  must  be  used  to  avoid  conflicts.  Rearrangeable 
means  that  existing  circuits  can  be  restructured  to  accomodate  a new  circuit  A comparison  of  circuit  and  packet 
switching  is  given  in  [98]. 

A network  generally  has  at  least  one  path  from  each  source  to  each  sink.  Omega  networks  [64]  and  topo- 
logically equivalent  networks  (equivalent  under  relabeling  of  sources  and  sinks)  have  unique  paths.  Adding  extra 
stages  to  such  networks  yields  multiple  paths  from  source  to  sink  ([23] ,[24]  ,[44] ,[60] ,[65] , [98] , [100]).  Alterna- 
tively, extra  paths  can  be  produced  by  changing  from  2 x 2 to  k x k switch  boxes  [84]. 

The  most  common  control  mechanism  for  switches  is  distributed;  i.e.  each  box  is  self-controlled.  This  is 
connected  with  data-routing  [46].  Often  self-routing  is  employed,  via  destination  tags  ([23],[24],[39],[64],[112]). 
By  inspecting  these  tags,  forwarded  with  packets,  boxes  are  able  to  route  packets  to  their  destination.  Central 
control  is  also  a possibility,  as  is  stage  control  [5]. 

A network  may  be  synchronous,  i.e.  it  may  accept  packets  only  at  the  beginning  of  memory  cycles.  A syn- 
chronous packet-switched  network  can  be  pipelined:  the  stages  of  the  network  form  a pipeline,  so  that  sets  of 
packets  can  be  present  concurrently  at  all  stages.  Packets  are  admitted  to  the  first  stage  at  the  start  of  each  cycle. 
Pipelining  cannot  be  used  with  circuit-switching;  a circuit  occupies  all  stages  simultaneously. 

Permutation  capabilities  refer  to  the  case  where  the  sources  all  submit  requests  to  different  sinks.  If  an 
equal  number  of  sources  and  sinks  are  present  this  forms  a permutation.  The  latter  is  realizable  if  the  requests 
can  be  routed  without  conflict.  Realization  capabilities  have  been  studied  for  the  omega  network  ([27], [46], [83]) 
and  its  equivalents,  as  well  as  its  single-stage  parent,  the  shuffle-exchange  network  ([61], [62], [114]).  Capabilities 
of  other  networks  have  been  studied  ([21], [103], [106])  as  well,  but  the  omega-like  case  is  most  interesting. 
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Determinations  of  capabilities  for  extra-stage  networks  are  challenging  ([1],[44],[65]). 

Switch  boxes  are  usually  crossbars,  connecting  each  input  to  each  output  In  the  2 x 2 case,  broadcast  capa- 
bility is  sometimes  assumed:  an  input  can  be  sent  to  both  outputs.  Two  inputs  are  never  sent  to  one  output. 
Also  in  the  2 x 2 case,  there  are  4 possible  settings:  straight  (1-1, 2-2),  interchange  (1 -2,2-1)  or  broadcast  ((1- 
1,1-2)  or  (2- 1,2-2)).  This  is  called  four-function  capability.  If  broadcast  is  excluded  it  is  called  a two-function 
box. 

A conflict  occurs  at  a switch  when  two  or  more  inputs  must  be  routed  to  the  same  output  In  a buffered  net- 
work a queue  provides  temporary  storage  for  blocked  requests  [58];  otherwise  one  request  is  selected  for  for- 
warding and  other  requests  must  be  discarded  or  resubmitted  by  the  sender.  Conflict  resolution  schemes  are  nor- 
mally fair  (random  selection)  but  other  arbitration  policies  are  occasionally  used  ([10],[20],[101]). 

The  most  well-known  networks  are  based  on  the  shuffle  ([19],[61]),  a permutation  which  sends  an  integer 
whose  binary  representation  is  (i , ...  i,  4 0 ) to  the  integer  with  representation  (in  2 , ...  40  ,in  l ).  Individual 
switches  perform,  in  interchange  mode,  exchanges,  sending  (i n l , ...  ,i  x 40  ) to  (i n l , ...  til  ,l-i0  ).  The  omega 
network  [64]  is  a multistage  shuffle/exchange  network  connecting  N inputs  to  N outputs.  It  is  composed  of  log2 
N stages  of  N/2  switches,  each  2x2.  Between  each  pair  of  stages,  routing  outputs  of  a stage  to  inputs  of  the 
next  stage  is  done  by  a shuffle.  Networks  equivalent  to  the  omega  include  the  flip  [5],  indirect  binary  n-cube 
[89],  modified  data  manipulator  and  baseline  [113],  generalized  cube  [99]  etc.  All  of  these  connect  N sources  to 
N sinks  with  a unique  path  from  source  to  sink;  they  all  have  log  2 N stages  with  each  stage  consisting  of  N/2 
switches,  each  a 2 x 2 box.  The  differences  between  them  are  in  the  control  structures  and  the  precise  class  of 
permutations  which  they  can  realize.  See  ([100],[112],[113])  for  more  details  on  these. 

The  omega-like  networks  have  been  generalized  in  various  ways.  Deltas  [85]  use  a x b switch  boxes  to 
connect  an  sources  to  bn  sinks;  the  2 x 2 case  yields  the  omega.  Banyans  [45]  are  characterized  by  the  property 
of  a unique  path  between  each  (source,sink)  pair,  they  include  deltas  and  in  particular  omegas  (square  SW- 
banyans).  On  the  other  hand,  multipath  networks,  as  noted  earlier,  have  been  obtained  by  adding  stages  while 
keeping  the  switch  box  or  expanding  switch  boxes  while  retaining  the  number  of  stages. 
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4.  Performance  models. 


As  noted  in  the  Introduction,  the  primary  determinant  of  the  performance  of  shared-memory  computers  is 
their  capability  for  concurrendy  connecting  sources  and  sinks.  A common  measure  of  this  capability  is  the 
bandwidth  of  the  interconnecdon  network.  Typically  bandwidth  refers  to  the  number  of  memory  requests  ac- 
cepted in  a memory  cycle  ([11], [12], [13], [18], [49], [86])  or  more  generally  the  number  of  memory  modules  ac- 
cessed concurrendy.  In  multiple  bus  systems  bandwidth  may  also  be  characterized  in  terms  of  the  number  of  ac- 
tive buses  in  a bus  cycle  ([77], [78]);  typically  these  characterizations  do  not  conflict  since  it  is  assumed  that  a 
transaction  between  processor  and  memory  can  always  be  completed  in  a bus  cycle.  Similarly,  if  networks  have 
cycles  it  is  usually  assumed  that  their  operation  is  synchronous  with  modules.  In  asynchronous  single  bus  sys- 
tems, bus  utilization  is  a measure  of  performance  [3]. 

Performance  models  typically  ignore  the  management  of  mutilevel  memory.  Accesses  to  private  caches  by 
a processor  are  generally  considered  part  of  computation  rather  than  communication;  systems  are  modeled  as  a 
collection  of  sources  (processors  or  caches)  sending  requests  (for  access)  to  a collection  of  sinks  (memory 
modules).  Requests  are  serviced  first  by  the  interconnection  network  and  then,  if  the  request  is  granted,  by  the 
module.  Two  forms  of  contention  arise:  over  the  use  of  the  network,  and  for  concurrent  access  to  a module.  We 
concentrate  here  on  studying  the  assumptions  made  by  various  modelers  concerning  the  issuing  of  requests  by 
processors  and  the  disposition  of  those  requests  in  passage  through  the  network.  In  particular  it  is  of  interest  to 
note  what  occurs  when  two  requests  conflict 
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5.  General  assumptions. 

There  are  few  general  assumptions  upon  which  authors  concur  on  this  subject.  An  exception  is  that  virtual- 
ly all  models  assume  that  processors  and  modules  are  homogeneous.  Furthermore,  a common  convention  is  that 
a processor  is  always  in  one  of  three  states  ([49],[52],[71],[72],[73],[75],[76]): 

1.  running,  i.e.  working  in  private  memory. 

2.  waiting  for  access  to  a module. 

3.  accessing  a module. 

However,  many  authors  assume  a processor  is  never  waiting;  this  is  because  it  is  assumed  that  when  a 
processor’s  request  for  memory  access  is  blocked,  it  simply  issues  another  request.  We  return  to  this  later. 
Also,  a few  authors  assume  that  a processor  is  never  running  (i.e.  it  spends  all  of  its  time  issuing  requests). 

Some  authors  define  a measure  supplementary  to  memory  bandwidth,  namely  processor  utilization 
([49],[52],[73],[75],[87]),  i.e.  the  number  of  processors  in  a running  state  during  a memory  cycle.  Of  course 
these  measures  are  interconnected. 

Many  authors  assume  the  network  and  memory  modules  are  synchronous:  processors  issue  requests  only  at 
the  beginning  of  memory  cycles  ([4],[6],[7],[8],[9],[11],[12],[13],[18],[22],[29],[30],[32],[34],[56],[59],[63],[74], 
[76],[79],[80],[82],T85],[86]S[90],[92],[93],[96],[104],[105],[109],[110],[118],[119])  or  occasionally  bus  cycles 
([75], [77], [78]).  In  the  case  of  multistage  interconnection  networks,  in  the  synchronous,  packet-switched  case 
the  network  is  typically  pipelined.  Less  frequently  the  system  is  allowed  to  be  asynchronous  ([3]  ,[26] ,[42], 
[47] , [48] , [52] , [5 3] ,[55] , [69] ,[70] , [72] , [73] , [ 1 08] ,[111],[115]).  The  asynchronous  case  arises  in  particular  in  ana- 
lyses of  real-time  systems  ([3],[47],[48],[69]). 

Multistage  networks  are  usually  blocking.  Packet-switching  is  often  assumed  ([20], [22], [23], 
[24],[32],[57],[58],[59],[81],[119])  but  circuit  switching  is  also  common  ([26],30],[40],[66],[82],[87],[88], 
[104],[105],[1 10]);  some  models  treat  both  ([31], [56]).  In  packet-switching,  the  default  is  messages  consisting  of 
single  packets;  but  occasionally  the  multipacket  case  is  considered  [31]. 

Blocking  networks  may  be  buffered  ([20],[24],[57],[81])  or  unbuffered  ([22], [59]);  sometimes  both  are  treat- 
ed ([31],[32],[56],[58],[119]).  Buffer  size  may  be  a parameter  [32].  In  buffered  systems  requests  are  normally 
queued  in  the  event  of  conflict;  but  then  full  buffers  are  another  form  of  blocking,  unless  queues  are  infinite 
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[81]. 


Many  analyses  assume  that  processors  have  private  memories;  in  some  cases  the  role  of  private  cache  is 
emphasized  ([15],[16],[35],[36],[53],[54],[67],[79],[81],[87],[88],[115]).  Caches  are  occasionally  shared 
([37], [116],  [117]). 

Other  issues  such  as  switch  control  [32]  and  data  routing  ([12], [20], [30], [82])  are  occasionally  addressed. 
Also,  in  the  event  of  conflict  at  a switch  or  over  bus  usage,  normally  one  processor  is  selected  randomly;  occa- 
sionally other  arbitration  policies  are  used  ([10],[20],[68],[101]).  Multistage  networks  can  also  be  differentiated 
on  the  basis  of  topology  or  shape  (square,  rectangular).  However,  the  most  important  specifications  (and  the  ones 
most  frequently  made  clear)  are  discussed  in  the  next  section. 
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6.  Distribution  and  treatment  of  processor  requests. 

Distribution  of  requests  by  processors  is  a specification  which  heavily  influences  bandwidth.  In  the  event  of 
conflict  between  requests  (e.g.  at  a switch  or  over  bus  usage)  the  disposition  is  also  a major  factor.  A partial 
taxonomy  of  assumptions  made  by  modelers  is  as  follows: 

I.  Let  p-  be  the  probability  that  processor  i issues  a request  to  module  j,  with  m modules  present.  Then 
possibilities  include: 

A.  p ~ = 1/m  for  all  i and  j;  this  is  the  uniform  case  in  which  requests  to  modules  are 
equiprobable  ([4],[6],[8],[12],[14],[18],[20],[22],[26],[30],[31],[32],[38],[42],[53],[55],[56], 
[59],[63],[70]-[79],[81],[82],[85]-[88],[92],[93],[104],[105],[109],[110],[115],[118],[119]). 

B.  for  some  a:  for  each  i,  for  some  j1  we  have  p.j  = a if  j = j.  and  (l-a)/(m-l)  otherwise;  i.e. 
each  processor  has  a favorite  memory  ([9],[10],[11],[13],[29],[50],[80]). 

C.  for  some  a and  k:  for  all  i,  p-  = a if  j = k and  (l-a)/(m-l)  otherwise;  i.e.  all  processors 
have  the  same  favorite  (hot)  memory  ([50], [90]). 

D.  for  some  a:  if  processor  i has  just  referenced  module  k,  then  in  the  next  cycle  p^  = a if  j 
= k and  (l-a)/(m-l)  otherwise  (local  reference  model)  ([96],[101]). 

E.  the  [p^  } are  independent  random  variables  ([41],[43],[102]). 

F.  for  each  j:  for  some  Pj  , p^  - pj  for  all  i ([34],[108]). 

II.  If  a processor’s  request  is  blocked: 

A.  the  request  is  discarded  ([7],[9],[12],[13],[14],[22],[26],[29],[56],[59],[77],[78],[80],[85], 
[86], [90], [104], [109], [1 10]). 

B.  the  request  is  resubmitted  ([4],[8],[18],[34],[38],[74],[76],[79],[82],[87],[88],[105],[118]). 

C.  A or  B ([30], [66]). 
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in.  In  a synchronous  system,  in  each  memory  cycle,  a processor: 


A.  issues  a new  request  with  probability  p,  p fixed,  0 < p < 1 ([4], [7], [9], [10], [12], [13], 
[22], [29], [30], [34], [38], [41], [56], [59], [63], [68], [74]-[78], [82], [85], [86], [104], [109], [110], [118], 
[119]). 

B.  issues  a new  request  with  probability  one  ([8], [18], [32], [96], [101]). 

IV.  All  requests  by  processors  are  independent  of  both  previous  requests  and  also  requests  issued  con- 
currently by  other  processors  ([7],[8],[9],[ll]-[14],[20],[22],[29],[31],[32],[38],[50],[59],[63],[77]-[80], 
[82], [86], [87], [88], [90], [93], [104], [109], [110], [115], [119]). 

V.  In  an  asynchronous  system,  time  between  requests  and  time  for  processors  to  access  modules  may 
vary,  but  both  have  the  same  mean  for  all  processors  ([26], [42], [52], [53], [55], [70], [71], [72], [73], 
[75], [108], [115]). 

It  is  important  to  observe,  as  noted  in  ([8],[38],[79],[82]),  that  (II-B)  and  (IV)  are  contradictory;  this 
simplifies  subsequent  analyses.  The  same  contradiction  occurs  in  ([87], [88])  but  is  not  explicitly  noted.  Implicitly 
this  issue  is  raised  whenever  (II-A)  is  invoked  as  well:  it  is  unrealistic  to  assume  that  a processor  whose  request 
is  discarded  will  forgive  and  forget  and  issue  a new  independent  request,  as  noted  in  ([26], [29], [77], [78], 
[80], [85], [86]).  In  other  words,  (TV)  is  king;  rarely  do  authors  try  to  defy  it,  since  the  resulting  analysis  is  com- 
plicated if  they  do.  For  example,  (IV)  is  violated  by  (I-D).  The  latter  is  reasonable  given  the  well-known  spatial 
and  temporal  locality  of  reference  exhibited  by  uniprocessors;  nonetheless  authors  avoid  it  scrupulously.  In 
([49],[81],[82],[91],[96],[101],[1 10])  it  is  noted  that  one  or  more  of  (I-A),  (III-B)  and  (IV)  are  also  typical  of  as- 
sumptions which  are  unrealistic  in  practice  but  which  simplify  analysis. 

It  should  be  noted  that  these  simplifications  are  akin  to  ignoring  issues  such  as  data  dependency  in  parallel- 
izing loops,  or  branching  in  instruction  lookahead. 
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7.  Further  notes. 


As  might  be  expected,  strong  assumptions  and  simple  architectures  combine  to  produce  the  most  tractable 
frameworks,  and  also  the  easiest  to  draw  conclusions  about  For  example,  authors  love  crossbars  ([9]-[13], 
[16], [29], [34], [38],[40], [41], [53], [63], [74], [76], [80], [86], [87], [88], [90], [92], [93], [94], [96], [101], [102], [104], [118]). 
However,  barring  new  technological  developments  (e.g.  [95]),  large  crossbars  are  unlikely  ever  to  be  implement- 
ed because  of  their  prohibitive  cost.  The  crossbar  has  the  unique  advantage  to  both  processors  and  modelers  of 
eliminating  one  of  the  two  sources  of  contention  encountered  by  requests,  namely  over  use  of  the  network.  The 
implication  for  modelers  is  that  if  this  architecture  is  combined  with  an  appropriate  set  of  assumptions  from  the 
preceding  section,  the  task  of  modeling  becomes  trivial.  For  example,  if  p processors  make  independent,  identi- 
cally distributed  requests  to  m modules  with  no  blocking  possible,  the  expected  bandwidth  is  simply  m(l-(l- 
1/m) p ).  This  is  an  extreme  example  of  how  strong  assumptions  can  produce  simple  results;  but  the  value  of  the 
latter  is  highly  questionable. 

Multiple  buses  are,  in  general,  not  as  easy  to  analyze  as  crossbars,  but  are  still  quite  tractable 
([8],[10],[11],[25],[28],[29],[36],[41],[42],[49],[52],[55],[63],[68],[70],[73],[75],[77],[78],[79],[108],[109],[115]). 
However,  multiple  bus  systems  are  as  rare  as  large  crossbars.  Single  buses  ([47],[48],[69],[71],[72],[1 1 1])  occur 
much  more  frequently  and  are  relatively  easy  to  analyze;  however  it  is  well-known  that  single  buses  support 
only  small  numbers  of  processors. 

Multistage  interconnection  networks  are  difficult  to  analyze  because  events  occur  at  each  stage.  They  are 
nonetheless  treated  frequently  ([9]-[13],[22],[26],[30],[31],[32],[40],[43],[56],[57],[59],[60],[66],[67],[81], 
[82],[85]-[88],[104],[105],[110],[119]).  The  role  of  independence  in  simplifying  analyses  should  be  noted  in  this 
regard,  as  we  remarked  in  the  last  section.  Many  analyses  also  exploit  uniqueness  of  paths  from  source  to  sink; 
a few  authors  permit  multipaths  ([23],[24],[82],[110]).  In  [20]  single-stage  networks  are  analyzed;  in  [58]  both 
single-stage  and  multistage  networks  are  considered. 

There  are  few  simple  conclusions  which  are  drawn  from  most  studies.  A random  selection  follows.  In  [13] 
it  is  noted  that  if  processors  have  favorite  modules  that  are  accessed  frequently,  multistage  networks  have  about 
the  same  (high)  bandwidth  as  crossbars.  Adding  extra  stages  produces  similar  effects:  wait  delays  in  queues  are 
reduced  [24].  In  [32]  buffers  of  different  sizes  are  considered,  including  zero;  a conclusion  is  that  diminishing 
gains  are  obtained  from  large  buffers.  An  implication  is  that  an  assumption  of  infinite  buffers  is  not  inaccurate, 
and  simplifies  modeling.  In  circuit  switched  networks,  tearing  down  partial  circuits  in  event  of  a block  may  be 
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preferable  to  holding  them  [66];  however,  the  opposite  conclusion  is  reached  in  [30].  In  [31]  it  is  shown  that 
packet-switching  may  actually  out-perform  circuit-switching  for  long  messages.  In  a multiple  bus  system,  one 
bus  for  two  processors  gives  nearly  the  same  bandwidth  as  a crossbar  (one  bus  per  processor)  ([63], [77], [78]). 

In  many  cases  the  significance  of  results  is  not  clear.  In  particular,  few  authors  show  that  their  analyses  are 
applicable  even  to  a single  real  application/architecture  pair. 
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8.  A case  study. 

A study  in  [56]  illustrates  some  strengths  and  weaknesses  of  the  modeling  efforts  catalogued  in  this  report. 
This  study  has  been  cited  frequently  in  the  literature,  and  it  is  one  of  the  few  papers  on  multistage  networks 
which  contain  simple  conclusions.  In  one  section  the  authors  assume: 

A.  a banyan  network  (multistage  with  unique  path  from  each  source  to  each  sink). 

B.  k x k switches. 

C.  unbuffered. 

D.  kn  inputs,  kn  outputs  and  n stages  (square  banyan). 

E.  synchronous  network. 

F.  processors  act  as  independent,  identically  distributed  random  processes  in  issuing  requests  (more 

briefly,  requests  are  independent). 

G.  a processor  generates,  with  probability  p,  a packet  in  each  cycle. 

H.  a processor’s  packets  are  directed  to  all  modules  with  equal  probability. 

I.  when  several  packets  conflict  at  a switch,  one  is  fowarded  and  the  rest  discarded. 

As  may  be  seen  from  Section  6,  many  of  these  assumptions  are  typical,  but  this  particular  set  is  stronger 
than  most  as  a group.  To  explore  the  consequences,  let  Pj  be  the  probability  that  an  input  line  to  a switch  at 
stage  j has  a packet.  Then  it  follows  easily  that 

Pj+i  = i - (i  - •—)*  , 0 

PO  = P 

It  follows  that  the  probability  that  a packet  is  not  blocked  is  pn  ; if  the  packet  is  replaced  by  a request  to 
set  up  a circuit,  p is  the  probability  that  a circuit  is  established.  The  simplicity  of  this  result,  and  its  simultane- 
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ous  applicability  to  both  circuit  and  packet  switching,  raise  the  question  as  to  whether  such  bandwidth  calcula- 
tions have  real  significance.  The  answer  is  yes,  but  only  if  an  application  has  the  courtesy  to  exhibit  the  required 
behavior.  In  particular,  requests  must  be  issued  independently  and  be  directed  randomly  to  all  modules,  and 
when  their  requests  are  discarded,  processors  must  not  resubmit  their  requests  and  upset  the  independence  as- 
sumption. 

It  should  be  noted  that  the  above  work  in  [56]  is  based  on  earlier  work  in  [86].  In  particular,  this  strong  set 
of  assumptions  is  essentially  carried  over  from  [86].  Furthermore,  this  work  is  continued  in  [22]  and  [59];  again 
these  strong  assumptions  are  made.  This  is  an  example  of  a series  of  increasingly  precise  conclusions  being 
drawn  about  a situation  whose  existence  is  debatable. 
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9.  Conclusions. 


In  inspecting  the  modeling  efforts  surveyed  here,  some  interesting  observations  emerge.  For  example: 

1.  Architectures  are  selected  for  modeling  on  a somewhat  random  basis.  There  is  roughly  an  equal 
split  between  crossbar,  multiple  bus  and  multistage,  despite  the  paucity  of  the  first  two  in  present  or 
planned  machines. 

2.  Modelers  lag  behind  machine  designers.  For  example,  a recent  trend  has  been  toward  clustered  sys- 
tems; [2]  is  one  of  the  few  modeling  efforts  to  address  this  approach  to  shared-memory  architectures. 

3.  In  many  instances,  and  often  by  authors’  admissions,  assumptions  about  systems  and  requests  issued 
by  processors  are  made  for  the  express  purpose  of  tractability.  Behavior  of  real  applications  on  real 
machines  is  often  an  extraneous  consideration. 

4.  In  particular,  assumptions  made  by  modelers  usually  center  around  independence  and  random  distri- 
bution of  requests  for  memory  access,  despite  the  well-known  theory  of  spatial  and  temporal  locality 
for  uniprocessors  which,  if  extrapolated  to  parallel  machines,  would  invalidate  these  assumptions. 

5.  Most  analytical  models  of  performance  are  statistical  in  nature,  explaining  the  strong  bias  towards 
independence  and  uniformity.  Models  with  this  orientation  implicitly  assume  a population  of  processes 
lacking  a high  degree  of  interaction  which  would  produce  dependence  and  nonuniformity  in  accessing 
patterns. 

6.  Effects  of  synchronization  are  rarely  considered.  It  has  been  noted  that  synchronization  mechanisms 
such  as  global  locks  and  counters  produce  very  nonuniform  access  patterns. 

It  is  interesting  to  note  that  some  of  the  older  studies  ([7], [14], [91])  placed  considerable  emphasis  on  con- 
necting models  and  real  application  programs,  e.g.  in  regard  to  storage  patterns.  For  example,  it  was  noted  that 
low-  and  high-order  interleaving  have  different  effects  on  locality  of  reference.  More  recent  studies  have  become 
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increasingly  abstract,  with  a consequent  weakening  of  the  correlation  between  models  and  the  behavior  of  real 
programs.  Some  exceptions  should  be  noted:  for  example,  in  [38]  a model  is  developed  for  synchronous  iterative 
computations  arising  in  connection  with  such  problems  as  solution  of  linear  systems  and  partial  differential 
equations.  This  framework  is  then  interconnected  with  performance  evaluation  of  various  possible  interconnec- 
tion networks.  In  [27]  a study  is  made  of  the  connection  between  access  capabilities  of  networks  and  applica- 
tions such  as  Fast  Fourier  Transform  and  grid  computations.  A particularly  interesting  exposition  is  found  in 
[81],  where  general  models  of  both  program  structure  and  network  performance  are  developed.  The  authors  also 
note  that  many  assumptions  made  by  modelers  are  unrealistic.  In  particular,  they  note  that  many  applications  in- 
volve considerable  synchronization,  producing  frequent  access  to  a few  memory  modules.  In  Section  6 we  noted 
that  treatment  of  the  cases  of  favorite  and  hot  memories  has  been  nearly  nonexistent 

Unfortunately  the  three  preceding  papers  are  among  the  relatively  few  instances  in  which  performance 
modeling  of  shared-memory  systems  and  the  behavior  of  actual  applications  are  interconnected.  This  leaves  an 
enormous  gap  to  be  filled.  Much  more  work  needs  to  be  done  in  examining  the  profiles  of  parallel  applications 
with  regard  to  patterns  of  memory  access.  Furthermore,  at  this  point  a characterization  of  "typical"  interconnec- 
tion schemes  does  not  exist,  although  as  we  have  noted  it  appears  likely  that  multistage  networks  will  be  prom- 
inent Pending  accumulation  of  this  type  of  information  on  applications  and  architectures  it  is  difficult  to  ascer- 
tain whether  the  models  discussed  here  are  accurate  predictors  of  performance. 

One  major  inhibitor  to  date  in  the  accumulation  of  this  needed  information  is  the  distribution  of  architec- 
tures. Most  current  nonhierarchical  shared-memory  machines  are  based  on  single  buses,  which  do  not  support 
concurrent  memory  access.  On  such  systems  the  issue  of  bandwidth,  as  defined  in  Section  4,  does  not  even  arise 
per  se.  Thus  there  is  at  present  an  almost  total  discrepancy  between  common  architectures  and  modeled  architec- 
tures. It  appears  that  this  situation  will  change  in  the  near  future,  spurred  by  support  from  sources  such  as  the 
Defense  Advanced  Research  Projects  Agency.  When  shared-memory  architectures  begin  to  proliferate,  presum- 
ably the  preceding  information  gap  will  begin  to  close,  and  it  should  become  much  clearer  as  to  which  of  the 
preceding  models,  if  any,  have  practical  significance.  Until  then  the  lack  of  knowledge  of  profiles  of  typical  ap- 
plications and  machines  makes  it  difficult  to  assess  these  models  with  a high  degree  of  confidence. 
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